Introduction

In this report, we will analyze the TelcoChurn dataset, which contains data on 7043 customers of a telecommunication company. The dataset has various variables that describe the customers’ service and billing information along with their usage patterns as categorical data. One of the main variables of interest is Churn, which indicates whether the customer left the company or not at the time of data collection. The goal of this report is to explore the factors that influence customer churn and to provide recommendations for improving customer retention.

Loading Required Libraries

library(ggplot2)
library(plotly)
library(caret)
library(factoextra)
library(dplyr)
library(fpc)
library(dbscan)
library(Hmisc)
library(corrplot)
library(missForest)
library(cluster)
library(knitr)
library(FNN)
library(solitude)

Loading Data

telco_raw <- read.csv("TelcoChurn.csv")

str(telco_raw)
## 'data.frame':    7043 obs. of  22 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
##  $ ChurnYes        : int  0 0 1 0 1 1 0 0 1 0 ...
telco_raw <- telco_raw[,-1] #removing first column #the column of customerID is not important so i will drop it

Let’s see if our data has missing values.

colSums(is.na(telco_raw))
##           gender    SeniorCitizen          Partner       Dependents 
##                0                0                0                0 
##           tenure     PhoneService    MultipleLines  InternetService 
##                0                0                0                0 
##   OnlineSecurity     OnlineBackup DeviceProtection      TechSupport 
##                0                0                0                0 
##      StreamingTV  StreamingMovies         Contract PaperlessBilling 
##                0                0                0                0 
##    PaymentMethod   MonthlyCharges     TotalCharges            Churn 
##                0                0               11                0 
##         ChurnYes 
##                0

Imputation for Missing Values

Upon examination, we notice that 11 customers in our dataset have missing values for “Total charges” column. This corresponds to approximately 0.01% of the total values (21 * 7043 = 147903) in the dataset. Given this proportion, the impact on the overall variance is deemed minimal, whether these values are imputed or left as is. Consequently, opting for a sophisticated imputation method is considered unnecessary. Instead, a pragmatic approach is taken by choosing to impute these missing values with complete case imputation. This decision serves to simplify the process, reducing computational complexity and resource requirements.

telco_imp <- na.omit(telco_raw)
str(telco_imp)
## 'data.frame':    7032 obs. of  21 variables:
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
##  $ ChurnYes        : int  0 0 1 0 1 1 0 0 1 0 ...
##  - attr(*, "na.action")= 'omit' Named int [1:11] 489 754 937 1083 1341 3332 3827 4381 5219 6671 ...
##   ..- attr(*, "names")= chr [1:11] "489" "754" "937" "1083" ...

Preliminary Data Visualization

We kick off our analysis with a visual exploration of key variables in the TelcoChurn dataset. Let’s begin by examining the data through informative visualizations to uncover initial insights and patterns that may illuminate the factors influencing customer churn. This will be essential in formulating initial hypotheses for our data and predictions.

Target Column - Churn

Let’s see the value counts of our target or reference.

props <- as.data.frame(prop.table(table(telco_imp$Churn))*100)
colnames(props) <- c("Churn Status", "Percentage")
props
##   Churn Status Percentage
## 1           No    73.4215
## 2          Yes    26.5785
# Create a bar plot 
ggplot(props, aes(x = `Churn Status`, y = Percentage, fill = `Churn Status`)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Churn Rate",
       x = "Churn Status",
       y = "Percentage") +
    scale_fill_manual(values = c("lightgreen", "steelblue"))

Here, we can see that the churn rate is about 26.54% in our actual dataset.

Total Charges, Monthly Charges and Tenure - Churn Rate Plot

plot_tc <- ggplot(telco_imp, aes(x = TotalCharges, fill = Churn)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Total Charges by Churn Status",
       x = "Total Charges",
       y = "Density") +
  scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))

plot_mc <- ggplot(telco_imp, aes(x = MonthlyCharges, fill = Churn)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Monthly Charges by Churn Status",
       x = "Monthly Charges",
       y = "Density") +
  scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))

plot_tenure <- ggplot(telco_imp, aes(x = tenure, fill = Churn)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Tenure by Churn Status",
       x = "Tenure",
       y = "Density") +
  scale_fill_manual(values = c("No" = "green", "Yes" = "steelblue"))

annotations <- list( 
  list( 
    x = 0.15,  
    y = -0.1,  
    text = "Total Charges ($)",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
    x = 0.5,  
    y = -0.1,  
    text = "Monthly Charges ($)",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  ),  
  list( 
    x = 0.84,  
    y = -0.1,  
    text = "Tenure (in months)",  
    xref = "paper",  
    yref = "paper",  
    xanchor = "center",  
    yanchor = "bottom",  
    showarrow = FALSE 
  )
)

plotly::subplot(plot_tc, plot_mc, plot_tenure) %>%
  layout(annotations = annotations, title = "Density of Churn Across Total Charges, Monthly Charges, and Tenure", titlefont = list(size = 16))

Churn Rate by Senior Citizens

seniors <- as.factor(telco_imp$SeniorCitizen)

ggplot(telco_imp, aes(x = seniors, fill = Churn)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn Rate by SeniorCitizen",
       x = "Senior Citizen",
       y = "Count") +
  scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))

## Churn Rate by Contract Type

ggplot(telco_imp, aes(x = Contract, fill = Churn)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn Rate by Contract Type",
       x = "Contract Type",
       y = "Count") +
    scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))

Churn Rate by Gender

ggplot(telco_imp, aes(x = gender, fill = Churn)) +
  geom_bar(position = "dodge") +
  labs(title = "Churn Rate by Gender",
       x = "Gender",
       y = "Count") +
  scale_fill_manual(values = c("No" = "lightgreen", "Yes" = "steelblue"))

Initial Hypothesis

Analyzing the above data visualizations, we can make the following initial hypothses:

  • Gender Influence: Gender appears to have a negligible impact on churn rates.
  • Seniority Impact: The churn rate for senior citizens is approximately twice as high compared to younger demographics.
  • Contract Duration: Customers opting for month-to-month contracts exhibit higher churn rates in contrast to those with yearly contracts.
  • Monthly and Total Charges Dynamics: A notable trend indicates that the churn rate tends to increase when monthly charges are high. However, customers with higher total charges seem to be less inclined to leave the company.
  • Tenure: Customers with longer tenure are less inclined to leave the company as well.

These hypotheses provide a preliminary foundation for further investigation and analysis.

Data Cleaning, Variable Selection, and Scaling The Data

Now that we have established our initial hypothesis using the raw data, we can start our analysis. Next, we’ll streamline our dataset for unsupervised learning models by excluding the dependent variables. We will, however, later use the ChurnYes variable, which is a binary variable indicating whether the customer left the company or not, in the confusion matrix analysis as reference.

To enhance our dataset’s suitability for cluster analysis, we’ll also transform categorical/factor variables into dummy variables. This not only ensures compatibility but also facilitates the scaling process for subsequent analysis.

telco_clean <- subset(telco_imp, select = -c(Churn, ChurnYes))
telco_clean <- as.data.frame(model.matrix(~ . - 1, data = telco_clean))
str(telco_clean)
## 'data.frame':    7032 obs. of  31 variables:
##  $ genderFemale                        : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                          : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizen                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerYes                          : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsYes                       : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                              : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneServiceYes                     : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo phone service       : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                    : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceFiber optic          : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo internet service   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                   : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo internet service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                     : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo internet service : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                 : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo internet service      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                      : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo internet service      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                      : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo internet service  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                  : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractOne year                    : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo year                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingYes                 : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodCredit card (automatic): num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic check       : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed check           : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                      : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges                        : num  29.9 1889.5 108.2 1840.8 151.7 ...
colSums(is.na(telco_clean)) #checking
##                         genderFemale                           genderMale 
##                                    0                                    0 
##                        SeniorCitizen                           PartnerYes 
##                                    0                                    0 
##                        DependentsYes                               tenure 
##                                    0                                    0 
##                      PhoneServiceYes        MultipleLinesNo phone service 
##                                    0                                    0 
##                     MultipleLinesYes           InternetServiceFiber optic 
##                                    0                                    0 
##                    InternetServiceNo    OnlineSecurityNo internet service 
##                                    0                                    0 
##                    OnlineSecurityYes      OnlineBackupNo internet service 
##                                    0                                    0 
##                      OnlineBackupYes  DeviceProtectionNo internet service 
##                                    0                                    0 
##                  DeviceProtectionYes       TechSupportNo internet service 
##                                    0                                    0 
##                       TechSupportYes       StreamingTVNo internet service 
##                                    0                                    0 
##                       StreamingTVYes   StreamingMoviesNo internet service 
##                                    0                                    0 
##                   StreamingMoviesYes                     ContractOne year 
##                                    0                                    0 
##                     ContractTwo year                  PaperlessBillingYes 
##                                    0                                    0 
## PaymentMethodCredit card (automatic)        PaymentMethodElectronic check 
##                                    0                                    0 
##            PaymentMethodMailed check                       MonthlyCharges 
##                                    0                                    0 
##                         TotalCharges 
##                                    0

Great, we have taken care of missing values and have also converted all factor/categorical variables into dummy variables.
However, we do see that there are spaces and characters(parentheses) that R might find unusual. This may cause errors in our analysis and coding later on. Let’s take care of it now.

colnames(telco_clean) <- gsub("\\(automatic\\)", "", colnames(telco_clean)) #the word automatic adds no meaning or value to payment method variable
colnames(telco_clean) <- gsub(" ","_",colnames(telco_clean)) #sustituting space with underscore
str(telco_clean)
## 'data.frame':    7032 obs. of  31 variables:
##  $ genderFemale                       : num  1 0 0 0 1 1 0 1 1 0 ...
##  $ genderMale                         : num  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizen                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PartnerYes                         : num  1 0 0 0 0 0 0 0 1 0 ...
##  $ DependentsYes                      : num  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                             : num  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneServiceYes                    : num  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLinesNo_phone_service      : num  1 0 0 1 0 0 0 1 0 0 ...
##  $ MultipleLinesYes                   : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetServiceFiber_optic         : num  0 0 0 0 1 1 1 0 1 0 ...
##  $ InternetServiceNo                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityNo_internet_service  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineSecurityYes                  : num  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackupNo_internet_service    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OnlineBackupYes                    : num  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtectionNo_internet_service: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DeviceProtectionYes                : num  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupportNo_internet_service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TechSupportYes                     : num  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTVNo_internet_service     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingTVYes                     : num  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMoviesNo_internet_service : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ StreamingMoviesYes                 : num  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractOne_year                   : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo_year                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBillingYes                : num  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodCredit_card_          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic_check      : num  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed_check          : num  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                     : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges                       : num  29.9 1889.5 108.2 1840.8 151.7 ...

Min-Max Scaling

Given the diverse range of categorical (dummy) variables, min-max scaling is employed to transform these features into a common scale, ensuring each variable contributes equally to our analyses. This normalization technique brings uniformity to the data, preventing the undue influence of variables with different scales and fostering more robust and reliable results in subsequent analyses

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

telco_scaled <- as.data.frame(lapply(telco_clean, normalize))

summary(telco_scaled)
##   genderFemale      genderMale     SeniorCitizen      PartnerYes    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.4953   Mean   :0.5047   Mean   :0.1624   Mean   :0.4825  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  DependentsYes        tenure       PhoneServiceYes 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.1127   1st Qu.:1.0000  
##  Median :0.0000   Median :0.3944   Median :1.0000  
##  Mean   :0.2985   Mean   :0.4426   Mean   :0.9033  
##  3rd Qu.:1.0000   3rd Qu.:0.7606   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  MultipleLinesNo_phone_service MultipleLinesYes InternetServiceFiber_optic
##  Min.   :0.0000                Min.   :0.0000   Min.   :0.0000            
##  1st Qu.:0.0000                1st Qu.:0.0000   1st Qu.:0.0000            
##  Median :0.0000                Median :0.0000   Median :0.0000            
##  Mean   :0.0967                Mean   :0.4219   Mean   :0.4403            
##  3rd Qu.:0.0000                3rd Qu.:1.0000   3rd Qu.:1.0000            
##  Max.   :1.0000                Max.   :1.0000   Max.   :1.0000            
##  InternetServiceNo OnlineSecurityNo_internet_service OnlineSecurityYes
##  Min.   :0.0000    Min.   :0.0000                    Min.   :0.0000   
##  1st Qu.:0.0000    1st Qu.:0.0000                    1st Qu.:0.0000   
##  Median :0.0000    Median :0.0000                    Median :0.0000   
##  Mean   :0.2162    Mean   :0.2162                    Mean   :0.2865   
##  3rd Qu.:0.0000    3rd Qu.:0.0000                    3rd Qu.:1.0000   
##  Max.   :1.0000    Max.   :1.0000                    Max.   :1.0000   
##  OnlineBackupNo_internet_service OnlineBackupYes 
##  Min.   :0.0000                  Min.   :0.0000  
##  1st Qu.:0.0000                  1st Qu.:0.0000  
##  Median :0.0000                  Median :0.0000  
##  Mean   :0.2162                  Mean   :0.3449  
##  3rd Qu.:0.0000                  3rd Qu.:1.0000  
##  Max.   :1.0000                  Max.   :1.0000  
##  DeviceProtectionNo_internet_service DeviceProtectionYes
##  Min.   :0.0000                      Min.   :0.0000     
##  1st Qu.:0.0000                      1st Qu.:0.0000     
##  Median :0.0000                      Median :0.0000     
##  Mean   :0.2162                      Mean   :0.3439     
##  3rd Qu.:0.0000                      3rd Qu.:1.0000     
##  Max.   :1.0000                      Max.   :1.0000     
##  TechSupportNo_internet_service TechSupportYes   StreamingTVNo_internet_service
##  Min.   :0.0000                 Min.   :0.0000   Min.   :0.0000                
##  1st Qu.:0.0000                 1st Qu.:0.0000   1st Qu.:0.0000                
##  Median :0.0000                 Median :0.0000   Median :0.0000                
##  Mean   :0.2162                 Mean   :0.2901   Mean   :0.2162                
##  3rd Qu.:0.0000                 3rd Qu.:1.0000   3rd Qu.:0.0000                
##  Max.   :1.0000                 Max.   :1.0000   Max.   :1.0000                
##  StreamingTVYes   StreamingMoviesNo_internet_service StreamingMoviesYes
##  Min.   :0.0000   Min.   :0.0000                     Min.   :0.0000    
##  1st Qu.:0.0000   1st Qu.:0.0000                     1st Qu.:0.0000    
##  Median :0.0000   Median :0.0000                     Median :0.0000    
##  Mean   :0.3844   Mean   :0.2162                     Mean   :0.3884    
##  3rd Qu.:1.0000   3rd Qu.:0.0000                     3rd Qu.:1.0000    
##  Max.   :1.0000   Max.   :1.0000                     Max.   :1.0000    
##  ContractOne_year ContractTwo_year PaperlessBillingYes
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000     
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000     
##  Median :0.0000   Median :0.0000   Median :1.0000     
##  Mean   :0.2093   Mean   :0.2396   Mean   :0.5927     
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000     
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000     
##  PaymentMethodCredit_card_ PaymentMethodElectronic_check
##  Min.   :0.0000            Min.   :0.0000               
##  1st Qu.:0.0000            1st Qu.:0.0000               
##  Median :0.0000            Median :0.0000               
##  Mean   :0.2163            Mean   :0.3363               
##  3rd Qu.:0.0000            3rd Qu.:1.0000               
##  Max.   :1.0000            Max.   :1.0000               
##  PaymentMethodMailed_check MonthlyCharges    TotalCharges    
##  Min.   :0.0000            Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000            1st Qu.:0.1725   1st Qu.:0.04416  
##  Median :0.0000            Median :0.5184   Median :0.15909  
##  Mean   :0.2281            Mean   :0.4632   Mean   :0.26131  
##  3rd Qu.:0.0000            3rd Qu.:0.7126   3rd Qu.:0.43572  
##  Max.   :1.0000            Max.   :1.0000   Max.   :1.00000

Should we do PCA to reduce dimensionality? Why or why not?

Let’s do a Primary Component Analysis (PCA) and see the results to determine if it will be helpful.

dim(telco_scaled)
## [1] 7032   31

This shows that our dataset has 7032 rows and 31 variables as of now.

Primary Component Analysis (PCA)

We will perform a PCA and select the number of principal components that explain about 90% of the total variance in the data. This will allow us to reduce the dimensionality and complexity of the data while retaining most of the information.

telco.pca <- prcomp(telco_scaled)
summary(telco.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.3109 0.7871 0.7074 0.65010 0.52230 0.48191 0.46287
## Proportion of Variance 0.3005 0.1083 0.0875 0.07391 0.04771 0.04061 0.03747
## Cumulative Proportion  0.3005 0.4088 0.4963 0.57022 0.61793 0.65854 0.69601
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.45467 0.43657 0.42126 0.39339 0.38384 0.36974 0.36393
## Proportion of Variance 0.03615 0.03333 0.03103 0.02706 0.02576 0.02391 0.02316
## Cumulative Proportion  0.73216 0.76549 0.79652 0.82358 0.84935 0.87326 0.89642
##                           PC15    PC16    PC17    PC18   PC19    PC20    PC21
## Standard deviation     0.35445 0.33180 0.31313 0.29986 0.2789 0.22867 0.18342
## Proportion of Variance 0.02197 0.01925 0.01715 0.01572 0.0136 0.00914 0.00588
## Cumulative Proportion  0.91839 0.93764 0.95479 0.97051 0.9841 0.99326 0.99914
##                           PC22     PC23      PC24      PC25      PC26      PC27
## Standard deviation     0.06951 0.009603 3.297e-14 5.943e-16 4.963e-16 1.245e-16
## Proportion of Variance 0.00085 0.000020 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  0.99998 1.000000 1.000e+00 1.000e+00 1.000e+00 1.000e+00
##                             PC28      PC29      PC30      PC31
## Standard deviation     1.245e-16 1.245e-16 1.245e-16 6.659e-17
## Proportion of Variance 0.000e+00 0.000e+00 0.000e+00 0.000e+00
## Cumulative Proportion  1.000e+00 1.000e+00 1.000e+00 1.000e+00

In the above summary, we see that we can obtain about 90% of the variance using 15 Primary components. We have effectively reduced our dataset from 31 columns to 15. Let’s create a new dataset with these 15 PCs for our analysis.

PCA Visualization

library(ggbiplot)
ggbiplot(telco.pca)

Here we see that ContractTwo_year, DependentsYes, PhoneServiceYes, PaymentMethods, along with some other variables contribute the highest to PC1. Given that PC1 alone explains about 30.1% of the variance in our dataset, we can also say that these variables are substantial in characterizing our customers.

Cumulative Variance Plot

cumulativepro <- cumsum(telco.pca$sdev^2 / sum(telco.pca$sdev^2))
plot(cumulativepro, type = "l", xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(h = 0.9, col="blue", lty=5)

The cumulative variance plot further illustrates our choice of 15 PCs.

PCA Dataframe

telco.pca_df <- as.data.frame(telco.pca$x[,1:15])
summary(telco.pca_df)
##       PC1               PC2                PC3               PC4          
##  Min.   :-1.7554   Min.   :-1.59630   Min.   :-0.7990   Min.   :-2.11968  
##  1st Qu.:-0.9001   1st Qu.:-0.59522   1st Qu.:-0.7189   1st Qu.:-0.40535  
##  Median :-0.4684   Median :-0.04814   Median : 0.6378   Median : 0.09843  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1208   3rd Qu.: 0.57155   3rd Qu.: 0.6940   3rd Qu.: 0.47937  
##  Max.   : 2.6244   Max.   : 2.11816   Max.   : 0.8220   Max.   : 1.29461  
##       PC5                PC6               PC7                PC8          
##  Min.   :-1.31894   Min.   :-1.7439   Min.   :-1.29676   Min.   :-1.57258  
##  1st Qu.:-0.40386   1st Qu.:-0.2769   1st Qu.:-0.22153   1st Qu.:-0.27902  
##  Median :-0.03369   Median :-0.0146   Median : 0.09828   Median : 0.02988  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.39564   3rd Qu.: 0.3075   3rd Qu.: 0.31956   3rd Qu.: 0.30623  
##  Max.   : 1.50058   Max.   : 1.3817   Max.   : 1.12561   Max.   : 1.32613  
##       PC9               PC10               PC11                PC12         
##  Min.   :-1.1800   Min.   :-1.34761   Min.   :-1.294965   Min.   :-1.42558  
##  1st Qu.:-0.2879   1st Qu.:-0.28289   1st Qu.:-0.249400   1st Qu.:-0.25347  
##  Median : 0.0266   Median : 0.01177   Median : 0.009472   Median :-0.01144  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.00000  
##  3rd Qu.: 0.2497   3rd Qu.: 0.27639   3rd Qu.: 0.236322   3rd Qu.: 0.24679  
##  Max.   : 1.2822   Max.   : 1.41890   Max.   : 1.580278   Max.   : 1.37524  
##       PC13               PC14               PC15         
##  Min.   :-1.08525   Min.   :-1.15657   Min.   :-1.20127  
##  1st Qu.:-0.25608   1st Qu.:-0.25322   1st Qu.:-0.19082  
##  Median : 0.04359   Median : 0.03605   Median : 0.01886  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.19056   3rd Qu.: 0.20133   3rd Qu.: 0.18024  
##  Max.   : 1.16017   Max.   : 1.43827   Max.   : 1.26639

Final Thoughts for PCA

Principal Component Analysis (PCA) is a valuable technique for dimensionality reduction, particularly in scenarios where datasets exhibit high dimensionality, such as this one. In the case of the Telco Churn dataset, which involves numerous features related to customer behavior and characteristics, PCA could be considered. PCA may offer benefits such as mitigating multicollinearity, handling high-dimensional data, and potentially revealing underlying patterns. However, it is crucial to weigh these advantages against the interpretability of the original features. The Telco Churn dataset involves variables that are directly tied to customer attributes and behaviors, making their interpretability paramount. Additionally, the dataset size and the desire to maintain the specific meaning of each feature may impact the suitability of PCA. As PCA assumes linear relationships between variables, it is essential to assess whether such assumptions align with the underlying patterns in the data. A thoughtful evaluation of the trade-offs between dimensionality reduction and interpretability should guide the decision on whether to employ PCA in the Telco Churn dataset analysis.

Hence, considering the Telco Churn dataset’s context, where the telecom company seeks detailed insights into the specific customer profiles associated with churn, opting against PCA may be a prudent choice. PCA inherently transforms original features into linear combinations, potentially diluting the individual interpretability of each variable. In this scenario, maintaining the fidelity of the original features becomes paramount, as the company aims to understand the nuanced characteristics contributing to customer churn. The absence of PCA allows for a more direct examination of the relationships between individual features and churn behavior, providing a clearer and more granular understanding of the factors influencing customer decisions. This decision aligns with the company’s strategic goal of tailoring retention strategies to the unique attributes of customers prone to churning, enhancing the efficacy of targeted interventions.

K-Means Cluster Model

Let’s start with finding the optimal number of clusters for this dataset. As instructed, we will use the raw data that is cleaned and scaled, and not the PCA dataset for our analysis.

Optimal Cluster Size

fviz_nbclust(telco_scaled, kmeans, method = "silhouette")

fviz_nbclust(telco_scaled, kmeans, method = "wss")

fviz_nbclust(telco.pca_df, kmeans, method = "gap_stat", nboot=10)

Here, we see that Silhoutte method recommends an optimal cluster size of 2. The WSS method also has an elbow at 2 clusters. However, the gap stat method recommends a cluster size of 3. Let’s visualize our cluster to see if 3 clusters make sense for our data.

K-Means Cluster Visualization - 2 Clusters

set.seed(12345)

km <- kmeans(telco_scaled, 3)

fviz_cluster(km, data = telco_scaled)  ## cluster visualization

Upon examining our clustered data, it becomes apparent that clusters 1 and 3 exhibit some degree of overlap, hinting at potential similarities or shared characteristics. Despite this, the overall distinctiveness of the clusters remains discernible.
Let’s visualize our cluster to see if 2 clusters make sense for our data.

K-Means Cluster Visualization

km2 <- kmeans(telco_scaled, 2)

fviz_cluster(km2, data = telco_scaled)  ## cluster visualization

The visual inspection of our clusters indeed reveals distinct patterns, suggesting meaningful segmentations within our dataset. However, given the substantial size of our dataset, opting for 3 clusters seems appropriate. This choice allows for a nuanced characterization of our customer base, leveraging the additional granularity provided by a higher number of clusters. The decision to proceed with three clusters is driven by the aim to glean richer insights into customer behavior, fostering a more detailed and comprehensive understanding that aligns with the diverse profiles present in our data.

K- Means Cluster Interpretation

For our interpretation and understanding of customer groups, let’s look at the raw numerical averages, before scaling, of different variables. This will help us build customer profiles based on different clusters and identify certain clusters that may be “churn-sensitive”.

table(km$cluster)
## 
##    1    2    3 
##  803 5512  717
telco_imp$cluster <- km$cluster

result <- telco_imp %>%
  group_by(cluster) %>%
  dplyr::summarize(across(where(is.numeric), mean, na.rm = TRUE),
            across(where(is.character), ~names(which.max(table(.))))
            )
print(result)
## # A tibble: 3 × 22
##   cluster SeniorCitizen tenure MonthlyCharges TotalCharges ChurnYes gender
##     <int>         <dbl>  <dbl>          <dbl>        <dbl>    <dbl> <chr> 
## 1       1        0.0399   20.3           20.6         425.   0.111  Male  
## 2       2        0.198    32.9           76.9        2730.   0.319  Male  
## 3       3        0.0279   42.3           21.6         934.   0.0335 Female
## # ℹ 15 more variables: Partner <chr>, Dependents <chr>, PhoneService <chr>,
## #   MultipleLines <chr>, InternetService <chr>, OnlineSecurity <chr>,
## #   OnlineBackup <chr>, DeviceProtection <chr>, TechSupport <chr>,
## #   StreamingTV <chr>, StreamingMovies <chr>, Contract <chr>,
## #   PaperlessBilling <chr>, PaymentMethod <chr>, Churn <chr>

Here, examining the average values within each cluster unveils distinctive cluster characteristics:

Cluster Count Key Insights
1 803 Customers Cluster 1 primarily comprises a relatively lower concentration of senior citizens and relatively newer customers with the shortest tenure. These individuals have the lowest average monthly charges compared to other clusters and also demonstrate the lowest average total charges, indicating minimal spending with the company. Typically, these customers are not enrolled in paperless billing, prefer mailed checks for payment, and often have a single phone line with no internet service through Telcom. They are likely to have a month-to-month contract and be single males without dependents or a partner.
2 5512 Customers Cluster 2 is characterized by the highest churn rate, a relatively higher concentration of senior citizens and slightly older customers with better tenure than those in Cluster 1. On average, these customers have the highest monthly charges and total charges compared to other clusters. They tend to prefer paperless billing and often pay through Electronic Check. Additionally, they typically have more than one phone line and subscribe to internet services through Telcom. These customers are likely to have a month-to-month contract and be single males without dependents or a partner.
3 717 Customers Cluster 3 exhibits the lowest concentration of senior citizens and the oldest customers with the highest tenure. On average, these customers have lower monthly charges compared to other clusters and comparatively lower total charges. Similar to Cluster 1, they are not typically enrolled in paperless billing and often pay through mailed checks. They usually have two year contracts and a single phone line and no internet service through Telcom. These customers are likely to be females with dependents or a partner.

Churn Likely Customers

In our analysis, we have identified Cluster 2 as having the highest churn rate among the three clusters. This cluster stands out as it comprises customers who exhibit certain characteristics that make them more prone to churn. Understanding the profile of churn-sensitive customers is crucial for Telcom to implement targeted strategies aimed at customer retention.

Cluster 2 Characteristics:

  • Demographics: This cluster shows a relatively higher concentration of senior citizens and customers with better tenure than those in other clusters.
  • Billing Preferences: Customers in this cluster tend to prefer paperless billing and often choose to pay through Electronic Check.
  • Service Subscriptions: Cluster 2 customers typically have more than one phone line and subscribe to internet services through Telcom.
  • Financial Behavior: On average, customers in this cluster have the highest monthly charges and total charges compared to other clusters.

This somewhat aligns with our initial hypothesis that senior citizen customers with high monthly charges are more likely to churn than other customers. We do see that our initial hypothesis that customers with high total charges are less inclined to churn is not correct, as per our k-means model. We also hypothesized that customers with month to month contract have higher churn tendency which is confirmed by our model.

Building the Churn-Sensitive Customer Profile:

Based on the characteristics of Cluster 2, Telcom can identify key attributes associated with customers more likely to churn: 1. Age Factor: Senior citizens and older customers within this cluster might be more sensitive to service changes or pricing, requiring tailored communication and incentives. 2. Billing and Payment Preferences: The preference for paperless billing and electronic payment methods indicates a technological inclination. Telcom can explore personalized digital communication channels for this segment. 3. Service Subscription Impact: Customers with multiple phone lines and internet services may have higher expectations. Offering targeted promotions or service enhancements can address their needs.

By recognizing these patterns and understanding the distinct behaviors of Cluster 2, Telcom can implement targeted retention strategies. This may include personalized communications, loyalty programs, or special offers to mitigate churn risk among customers with similar characteristics.

Confusion Matrix

Let’s test the accuracy and sensitivity of our model using a confusion matrix.

pred <- ifelse(telco_imp$cluster == 2, "Churn", "No Churn")
actual <- ifelse(telco_imp$ChurnYes == 1,  "Churn", "No Churn")

cfmatrix <- confusionMatrix(data = factor(pred), reference = factor(actual))
cfmatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churn No Churn
##   Churn     1756     3756
##   No Churn   113     1407
##                                           
##                Accuracy : 0.4498          
##                  95% CI : (0.4381, 0.4615)
##     No Information Rate : 0.7342          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1308          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9395          
##             Specificity : 0.2725          
##          Pos Pred Value : 0.3186          
##          Neg Pred Value : 0.9257          
##              Prevalence : 0.2658          
##          Detection Rate : 0.2497          
##    Detection Prevalence : 0.7838          
##       Balanced Accuracy : 0.6060          
##                                           
##        'Positive' Class : Churn           
## 

Our model’s performance metrics are indicative of its effectiveness in capturing relevant patterns within the data. Specifically, the model exhibits a commendable sensitivity of 0.94, reflecting its ability to correctly identify instances of positive churn. However, it’s crucial to note that the accuracy stands at 0.45, underscoring the need for further exploration and potential refinement. The interplay between sensitivity and accuracy warrants a nuanced examination to optimize our model for a more balanced and reliable predictive outcome.

Conclusion

In this analysis, we applied the k-means clustering algorithm to the TelcoChurn dataset, which contains data on 7043 customers of a telecommunication company. The dataset has various variables that describe the customers’ service, billing, and payment information along with some usage data. The goal of this analysis is to explore the factors that influence customer churn and to provide recommendations for improving customer retention.

Before applying the k-means algorithm, we performed some data preprocessing steps, such as removing 11 customers with missing values for Total charges, converting the categorical variables into dummy variables, and scaling the numeric variables using min-max scaling. We then used the gap stat, silhoutte, and wss (elbow) method to determine the optimal number of clusters, which was 3. We assigned each customer to one of the three clusters based on their similarity to the cluster centroids.

We also analyzed the characteristics of each cluster and found some key insights that can help us understand the customer behavior and preferences. Based on these findings, we can make some recommendations for improving customer retention and satisfaction. For the first cluster, we suggest that the company should offer more incentives and discounts for loyal customers, such as free upgrades, rewards, or referrals. The company should also provide more options and flexibility for the internet service, such as different plans, speeds, or features. The company should also communicate more frequently and effectively with these customers, such as sending personalized emails, surveys, or feedback requests. For the second cluster, we suggest that the company should promote and educate the customers about the benefits and features of the internet service, such as convenience, entertainment, or security. The company should also offer some trial or introductory offers for the internet service, such as free or discounted periods, or bundles with other services. The company should also maintain and improve the quality and reliability of the phone service, such as reducing dropped calls, improving coverage, or providing customer support.

We evaluated the performance of our model using two metrics: accuracy and sensitivity. Accuracy measures how well the model can correctly classify the customers into their respective clusters, while sensitivity measures how well the model can identify the customers who churned. Our model achieved an accuracy of 44.98% and a sensitivity of 93.95%. This means that our model can correctly cluster about 45% of all the customers, and can detect about 94% of the customers who left the company.

In conclusion, the k-means clustering algorithm provided a useful way to segment the customers based on their characteristics and behavior. However, the model had a low accuracy rate, which suggests that there is room for improvement. A larger and more diverse dataset could help to train the model better and capture more variation in the customer behavior. We could also experiment with different clustering algorithms and models with a larger dataset.